Predicting World Series Winners

Fall 2016 Jack Limongelli (jal839@stern.nyu.edu)

Introduction

Baseball is America's pasttime. It began in 1846 when the Carwright Knickerbockers lost to the New York Baseball Club in Hoboken, New Jersey. Fast foward eighty-one years, the 1927 Yankees capture the hearts of millions as they win the team's second World Series title and cement themselves as the single greatest team in the history of baseball. Figures like Babe Ruth and Lou Gehrig become popular culture icons, removing baseball from its wealthy-aristocratic origins and thrusting it to the masses. The game has evolved greatly since that historic year. Top tier players are now paid up to $30 million a year in the hope that they will help capture the Commissioner's Trophy. That being said, in 112 total World Series, six teams have managed to win 62 percent of championships and one team, the New York Yankees, have one 24 percent. The following question arises: Is there a formula for winning? This project attempts to answer that question.

Winning

There are two sides of baseball: hitting and pitching. Obviously, a team can't win games without scoring runs, but the oldest saying is that "defense wins championships." I will consider a teams ranking, relative to the other teams, in runs scored and runs allowed. Extending from those two categories, I will look at the team's regular season record as a percentage of total games won in order to account for the changing regular season length throughout history. Because these three categories are inhernently fundamental to the winning formula, I will also include the payroll of each team and the number of years since the team's previous World Series win. While the latter may not be as infomative historically bad teams, it may offer insight into the six teams mentioned above.


In [757]:
# Packages
import pandas as pd
import matplotlib.pyplot as plt

I importated pandas in order to read the excel spreadsheets that contain my data. Matplotlib will be used to create graphs, charts, and any other figure that helps describe my data.

Data

I collected my data from the following sources: http://www.baseball-reference.com, http://www.thebaseballcube.com/extras/payrolls/, http://www.espn.com/mlb/worldseries/history/winners. The first championship offically called the World Series was held in 1903 so that is my starting year. Unforunately, data regarding pay-roll only began to be tracked in 1988. However, it may still be useful considering the game today is more heavily based on signing free-agents rather than developing young talent as it was in the past. Because many teams have been added throughout the years, many teams will have either "0" or "NaN" for certain years.


In [758]:
#import wins
wins = '/Users/Jack/Desktop/Final/Wins.csv'
WinsTable = pd.read_csv(wins)
WinsTable


Out[758]:
Year Winner G ARI ATL BAL BOS CHC CHW CIN ... PIT SDP SFG SEA STL TBD TEX TOR WSN None
0 1903 BOS 140 0 41 46 65 59 43 53 ... 65 0 60 0 31 0 0 0 0 0
1 1904 None 154 0 36 42 62 60 58 57 ... 56 0 69 0 49 0 0 0 0 0
2 1905 SFG 154 0 33 35 51 60 60 51 ... 62 0 68 0 38 0 0 0 0 0
3 1906 CHW 154 0 32 49 32 75 60 42 ... 60 0 62 0 34 0 0 0 0 0
4 1907 CHC 154 0 38 45 38 69 56 43 ... 59 0 53 0 34 0 0 0 0 0
5 1908 CHC 154 0 41 54 49 64 57 47 ... 64 0 64 0 32 0 0 0 0 0
6 1909 PIT 153 0 29 40 58 68 51 50 ... 72 0 60 0 35 0 0 0 0 0
7 1910 OAK 154 0 34 31 53 68 44 49 ... 56 0 59 0 41 0 0 0 0 0
8 1911 OAK 154 0 29 29 51 60 50 45 ... 55 0 64 0 49 0 0 0 0 0
9 1912 BOS 154 0 34 34 68 59 51 49 ... 60 0 67 0 41 0 0 0 0 0
10 1913 OAK 154 0 45 37 51 57 51 42 ... 51 0 66 0 33 0 0 0 0 0
11 1914 ATL 154 0 61 46 59 51 45 39 ... 45 0 55 0 53 0 0 0 0 0
12 1915 BOS 154 0 54 41 66 47 60 46 ... 47 0 45 0 47 0 0 0 0 0
13 1916 BOS 154 0 58 51 59 44 58 39 ... 42 0 56 0 39 0 0 0 0 0
14 1917 CHW 154 0 47 37 58 48 65 51 ... 33 0 64 0 53 0 0 0 0 0
15 1918 BOS 129 0 41 45 58 65 44 53 ... 50 0 55 0 40 0 0 0 0 0
16 1919 CIN 140 0 41 48 47 54 63 69 ... 51 0 62 0 39 0 0 0 0 0
17 1920 CLE 154 0 40 49 47 49 62 53 ... 51 0 56 0 49 0 0 0 0 0
18 1921 SFG 154 0 51 53 49 42 40 45 ... 58 0 61 0 56 0 0 0 0 0
19 1922 SFG 154 0 34 60 40 52 50 56 ... 55 0 60 0 55 0 0 0 0 0
20 1923 NYY 154 0 35 48 40 54 45 59 ... 56 0 62 0 51 0 0 0 0 0
21 1924 MIN 154 0 34 48 44 53 43 54 ... 58 0 60 0 42 0 0 0 0 0
22 1925 PIT 154 0 45 53 31 44 51 52 ... 62 0 56 0 50 0 0 0 0 0
23 1926 STL 154 0 43 40 30 53 53 56 ... 55 0 48 0 58 0 0 0 0 0
24 1927 NYY 154 0 39 38 33 55 45 49 ... 61 0 60 0 60 0 0 0 0 0
25 1928 NYY 154 0 32 53 37 59 47 51 ... 55 0 60 0 62 0 0 0 0 0
26 1929 OAK 154 0 36 51 38 64 38 43 ... 57 0 55 0 51 0 0 0 0 0
27 1930 OAK 154 0 45 42 34 58 40 38 ... 52 0 56 0 60 0 0 0 0 0
28 1931 STL 154 0 42 41 40 55 36 38 ... 49 0 56 0 66 0 0 0 0 0
29 1932 NYY 154 0 50 41 28 58 32 39 ... 56 0 47 0 47 0 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
84 1987 MIN 162 0 43 41 48 47 48 52 ... 49 40 56 48 59 0 46 59 56 0
85 1988 LAD 162 0 33 33 55 48 44 54 ... 52 51 51 42 47 0 43 54 50 0
86 1989 OAK 162 0 39 54 51 57 43 46 ... 46 55 57 45 53 0 51 55 50 0
87 1990 CIN 162 0 40 47 54 48 58 56 ... 59 46 52 48 43 0 51 53 52 0
88 1991 MIN 162 0 58 41 52 48 54 46 ... 60 52 46 51 52 0 52 56 44 0
89 1992 TOR 162 0 60 55 45 48 53 56 ... 59 51 44 40 51 0 48 59 54 0
90 1993 TOR 162 0 64 52 49 52 58 45 ... 46 38 64 51 54 0 53 59 58 0
91 1994 None 117 0 58 54 46 42 57 56 ... 45 40 47 42 45 0 44 47 63 0
92 1995 ATL 145 0 62 49 59 50 47 59 ... 40 48 46 54 43 0 51 39 46 0
93 1996 NYY 162 0 59 54 52 47 52 50 ... 45 56 42 52 54 0 56 46 54 0
94 1997 FLA 162 0 62 60 48 42 49 47 ... 49 47 56 56 45 0 48 47 48 0
95 1998 NYY 163 40 65 48 56 55 49 47 ... 42 60 55 47 51 39 54 54 40 0
96 1999 NYY 163 61 63 48 58 41 46 59 ... 48 45 53 48 46 42 58 52 42 0
97 2000 NYY 162 52 59 46 52 40 59 52 ... 43 47 60 56 59 43 44 51 41 0
98 2001 ARI 162 57 54 39 51 54 51 41 ... 38 49 56 72 57 38 45 49 42 0
99 2002 ANA 162 60 62 41 57 41 50 48 ... 44 41 59 57 60 34 44 48 51 0
100 2003 FLA 162 52 62 44 59 54 53 43 ... 46 40 62 57 52 39 44 53 51 0
101 2004 BOS 162 31 59 48 60 55 51 47 ... 44 54 56 39 65 43 55 41 41 0
102 2005 CHW 162 48 56 46 59 49 61 45 ... 41 51 46 43 62 41 49 49 50 0
103 2006 STL 162 47 49 43 53 41 56 49 ... 41 54 47 48 51 38 49 54 44 0
104 2007 BOS 163 55 52 42 59 52 44 44 ... 42 55 44 54 48 40 46 51 45 0
105 2008 PHI 163 50 44 42 58 60 55 45 ... 41 39 44 37 53 60 48 53 36 0
106 2009 NYY 163 43 53 39 58 51 48 48 ... 38 46 54 52 56 52 53 46 36 0
107 2010 SFG 162 40 56 41 55 46 54 56 ... 35 56 57 38 53 59 56 52 43 0
108 2011 STL 162 58 55 43 56 44 49 49 ... 44 44 53 41 56 56 59 50 49 0
109 2012 SFG 162 50 58 57 43 38 52 60 ... 49 47 58 46 54 56 57 45 60 0
110 2013 BOS 163 50 59 52 60 40 39 55 ... 58 47 47 44 60 56 56 45 53 0
111 2014 SFG 162 40 49 59 44 45 45 47 ... 54 48 54 54 56 48 41 51 59 0
112 2015 KCR 162 49 41 50 48 60 47 40 ... 60 46 52 47 62 49 54 57 51 0
113 2016 CHC 162 43 42 55 57 64 48 42 ... 48 42 54 53 53 42 59 55 59 0

114 rows × 34 columns


In [759]:
#import Runs Against
RunsAgainst = '/Users/Jack/Desktop/Final/Runs Against.csv'
RunsAgainstTable = pd.read_csv(RunsAgainst)
RunsAgainstTable


Out[759]:
Year Winner ARI ATL BAL BOS CHC CHW CIN CLE ... PIT SDP SFG SEA STL TBD TEX TOR WSN None
0 1903 BOS NaN 6 3 1 2 7 4 6 ... 3 NaN 1 NaN 8 NaN NaN NaN NaN 0
1 1904 None NaN 7 6 1 2 2 3 2 ... 4 NaN 1 NaN 5 NaN NaN NaN NaN 0
2 1905 SFG NaN 6 6 3 1 1 5 4 ... 3 NaN 2 NaN 7 NaN NaN NaN NaN 0
3 1906 CHW NaN 8 3 8 1 1 5 2 ... 2 NaN 3 NaN 6 NaN NaN NaN NaN 0
4 1907 CHC NaN 8 5 6 1 1 5 3 ... 3 NaN 3 NaN 7 NaN NaN NaN NaN 0
5 1908 CHC NaN 7 3 4 3 2 6 1 ... 4 NaN 2 NaN 8 NaN NaN NaN NaN 0
6 1909 PIT NaN 7 6 5 1 2 5 4 ... 2 NaN 4 NaN 8 NaN NaN NaN NaN 0
7 1910 OAK NaN 7 8 5 1 2 6 7 ... 3 NaN 2 NaN 8 NaN NaN NaN NaN 0
8 1911 OAK NaN 8 8 3 3 2 6 4 ... 2 NaN 1 NaN 7 NaN NaN NaN NaN 0
9 1912 BOS NaN 8 6 1 3 3 5 5 ... 1 NaN 2 NaN 7 NaN NaN NaN NaN 0
10 1913 OAK NaN 6 6 5 4 1 7 2 ... 2 NaN 1 NaN 8 NaN NaN NaN NaN 0
11 1914 ATL NaN 3 6 1 6 5 7 8 ... 1 NaN 4 NaN 1 NaN NaN NaN NaN 0
12 1915 BOS NaN 3 7 2 7 3 5 6 ... 2 NaN 8 NaN 6 NaN NaN NaN NaN 0
13 1916 BOS NaN 1 4 1 5 2 7 7 ... 6 NaN 4 NaN 8 NaN NaN NaN NaN 0
14 1917 CHW NaN 3 7 1 5 2 8 3 ... 7 NaN 1 NaN 5 NaN NaN NaN NaN 0
15 1918 BOS NaN 5 5 1 1 3 6 4 ... 2 NaN 3 NaN 8 NaN NaN NaN NaN 0
16 1919 CIN NaN 7 5 4 2 2 1 3 ... 3 NaN 4 NaN 6 NaN NaN NaN NaN 0
17 1920 CLE NaN 6 5 4 5 3 4 2 ... 3 NaN 2 NaN 7 NaN NaN NaN NaN 0
18 1921 SFG NaN 6 5 1 7 7 3 3 ... 1 NaN 2 NaN 4 NaN NaN NaN NaN 0
19 1922 SFG NaN 7 2 5 5 3 2 7 ... 3 NaN 1 NaN 6 NaN NaN NaN NaN 0
20 1923 NYY NaN 7 2 8 4 3 1 5 ... 3 NaN 2 NaN 5 NaN NaN NaN NaN 0
21 1924 MIN NaN 7 6 5 5 8 1 7 ... 2 NaN 3 NaN 6 NaN NaN NaN NaN 0
22 1925 PIT NaN 6 7 8 5 3 1 5 ... 3 NaN 2 NaN 4 NaN NaN NaN NaN 0
23 1926 STL NaN 7 8 7 1 3 2 2 ... 5 NaN 3 NaN 4 NaN NaN NaN NaN 0
24 1927 NYY NaN 7 8 7 4 2 2 5 ... 3 NaN 6 NaN 5 NaN NaN NaN NaN 0
25 1928 NYY NaN 7 5 6 1 4 5 8 ... 6 NaN 4 NaN 2 NaN NaN NaN NaN 0
26 1929 OAK NaN 6 2 7 2 6 3 3 ... 4 NaN 1 NaN 5 NaN NaN NaN NaN 0
27 1930 OAK NaN 4 6 3 6 5 5 8 ... 7 NaN 3 NaN 2 NaN NaN NaN NaN 0
28 1931 STL NaN 4 7 4 6 8 7 5 ... 5 NaN 1 NaN 2 NaN NaN NaN NaN 0
29 1932 NYY NaN 2 7 8 1 6 5 3 ... 4 NaN 3 NaN 6 NaN NaN NaN NaN 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
84 1987 MIN NaN 12 13 11 11 4 9 14 ... 7 10.0 1 7.0 4 NaN 12.0 1.0 6.0 0
85 1988 LAD NaN 12 14 7 10 12 5 8 ... 6 3.0 7 10.0 9 NaN 9.0 6.0 4.0 0
86 1989 OAK NaN 9 7 10 5 12 11 5 ... 9 6.0 3 9.0 4 NaN 8.0 4.0 7.0 0
87 1990 CIN NaN 12 7 4 11 2 1 11 ... 4 6.0 9 5.0 8 NaN 6.0 3.0 2.0 0
88 1991 MIN NaN 3 13 6 12 5 9 9 ... 2 4.0 10 4.0 6 NaN 14.0 1.0 7.0 0
89 1992 TOR NaN 1 3 5 6 9 5 10 ... 3 7.0 9 14.0 4 NaN 12.0 8.0 2.0 0
90 1993 TOR NaN 1 6 3 7 1 12 11 ... 13 11.0 3 4.0 9 NaN 7.0 5.0 5.0 0
91 1994 None NaN 1 1 10 10 2 3 5 ... 12 9.0 5 9.0 13 NaN 14.0 6.0 2.0 0
92 1995 ATL NaN 1 2 6 8 10 4 1 ... 12 9.0 13 7.0 6 NaN 8.0 12.0 5.0 0
93 1996 NYY NaN 1 11 12 7 4 8 1 ... 12 4.0 13 7.0 6 NaN 5.0 6.0 3.0 0
94 1997 FLA NaN 1 1 12 8 10 10 7 ... 9 13.0 11 10.0 5 NaN 9.0 3.0 7.0 0
95 1998 NYY 13.0 1 7 2 11 14 8 5 ... 6 3.0 7 9.0 9 3.0 12.0 4.0 10.0 0
96 1999 NYY 3.0 1 3 1 15 10 4 8 ... 7 6.0 9 12.0 10 13.0 7.0 9.0 13.0 0
97 2000 NYY 5.0 1 12 1 15 7 6 5 ... 12 9.0 4 2.0 7 8.0 14.0 11.0 14.0 0
98 2001 ARI 2.0 1 10 5 4 8 14 9 ... 15 12.0 9 1.0 3 13.0 14.0 6.0 12.0 0
99 2002 ANA 5.0 1 7 3 11 8 13 10 ... 10 14.0 2 5.0 4 14.0 12.0 9.0 8.0 0
100 2003 FLA 5.0 9 9 8 4 3 15 7 ... 12 13.0 2 1.0 11 11.0 14.0 10.0 8.0 0
101 2004 BOS 14.0 3 9 4 2 10 15 13 ... 9 7.0 12 7.0 1 11.0 5.0 7.0 11.0 0
102 2005 CHW 14.0 5 10 11 7 3 16 1 ... 13 8.0 11 7.0 2 14.0 12.0 6.0 4.0 0
103 2006 STL 7.0 11 13 11 15 10 10 7 ... 9 1.0 8 9.0 5 12.0 8.0 5.0 16.0 0
104 2007 BOS 5.0 6 13 1 2 11 15 3 ... 14 1.0 3 10.0 13 14.0 12.0 2.0 10.0 0
105 2008 PHI 5.0 12 13 4 2 7 13 9 ... 16 10.0 9 11.0 7 2.0 14.0 1.0 15.0 0
106 2009 NYY 14.0 4 14 3 5 2 8 13 ... 11 12.0 1 1.0 3 7.0 4.0 11.0 16.0 0
107 2010 SFG 15.0 3 13 11 13 8 7 12 ... 16 1.0 2 6.0 5 2.0 4.0 9.0 12.0 0
108 2011 STL 8.0 3 14 9 14 7 12 10 ... 11 4.0 2 4.0 9 1.0 5.0 11.0 7.0 0
109 2012 SFG 9.0 4 8 12 14 6 1 14 ... 7 11.0 6 3.0 5 1.0 9.0 11.0 2.0 0
110 2013 BOS 12.0 1 9 6 10 10 4 7 ... 2 13.0 11 12.0 5 5.0 4.0 13.0 6.0 0
111 2014 SFG 14.0 3 3 11 13 13 5 7 ... 9 2.0 6 1.0 4 5.0 14.0 9.0 1.0 0
112 2015 KCR 9.0 13 7 14 4 10 12 2 ... 3 10.0 6 11.0 1 4.0 13.0 5.0 7.0 0
113 2016 CHC 15.0 11 9 3 1 9 13 2 ... 9 10.0 4 6.0 7 8.0 13.0 1.0 2.0 0

114 rows × 33 columns


In [760]:
#import Runs 
Runs = '/Users/Jack/Desktop/Final/Runs.csv'
RunsTable = pd.read_csv(Runs)
RunsTable


Out[760]:
Year Winner ARI ATL BAL BOS CHC CHW CIN CLE ... PIT SDP SFG SEA STL TBD TEX TOR WSN None
0 1903 BOS NaN 7 7 1 4 6 2 2 ... 1 NaN 3 NaN 8 NaN NaN NaN NaN 0
1 1904 None NaN 8 7 2 5 3 2 1 ... 3 NaN 1 NaN 4 NaN NaN NaN NaN 0
2 1905 SFG NaN 8 7 4 5 2 2 5 ... 4 NaN 1 NaN 6 NaN NaN NaN NaN 0
3 1906 CHW NaN 8 5 8 1 3 4 1 ... 3 NaN 2 NaN 7 NaN NaN NaN NaN 0
4 1907 CHC NaN 6 5 8 2 3 4 6 ... 1 NaN 2 NaN 8 NaN NaN NaN NaN 0
5 1908 CHC NaN 4 4 3 2 5 6 2 ... 3 NaN 1 NaN 8 NaN NaN NaN NaN 0
6 1909 PIT NaN 8 7 3 2 6 4 5 ... 1 NaN 3 NaN 5 NaN NaN NaN NaN 0
7 1910 OAK NaN 8 8 3 2 7 6 5 ... 4 NaN 1 NaN 5 NaN NaN NaN NaN 0
8 1911 OAK NaN 4 8 6 1 3 5 4 ... 3 NaN 2 NaN 6 NaN NaN NaN NaN 0
9 1912 BOS NaN 4 8 1 2 6 7 5 ... 3 NaN 1 NaN 6 NaN NaN NaN NaN 0
10 1913 OAK NaN 5 7 3 1 8 6 2 ... 4 NaN 3 NaN 8 NaN NaN NaN NaN 0
11 1914 ATL NaN 2 7 3 5 8 7 5 ... 8 NaN 1 NaN 6 NaN NaN NaN NaN 0
12 1915 BOS NaN 3 8 3 5 2 8 7 ... 6 NaN 3 NaN 1 NaN NaN NaN NaN 0
13 1916 BOS NaN 4 4 6 5 3 6 2 ... 7 NaN 1 NaN 8 NaN NaN NaN NaN 0
14 1917 CHW NaN 5 8 4 4 1 2 3 ... 8 NaN 1 NaN 6 NaN NaN NaN NaN 0
15 1918 BOS NaN 7 7 4 1 6 2 1 ... 4 NaN 3 NaN 5 NaN NaN NaN NaN 0
16 1919 CIN NaN 6 6 5 8 1 2 2 ... 5 NaN 1 NaN 7 NaN NaN NaN NaN 0
17 1920 CLE NaN 8 3 7 5 4 4 1 ... 7 NaN 1 NaN 2 NaN NaN NaN NaN 0
18 1921 SFG NaN 3 4 7 5 6 7 2 ... 4 NaN 1 NaN 2 NaN NaN NaN NaN 0
19 1922 SFG NaN 8 1 8 4 6 5 3 ... 1 NaN 3 NaN 2 NaN NaN NaN NaN 0
20 1923 NYY NaN 8 6 8 3 5 7 1 ... 2 NaN 1 NaN 6 NaN NaN NaN NaN 0
21 1924 MIN NaN 8 4 7 5 3 7 5 ... 3 NaN 1 NaN 2 NaN NaN NaN NaN 0
22 1925 PIT NaN 7 2 8 6 5 8 6 ... 1 NaN 5 NaN 2 NaN NaN NaN NaN 0
23 1926 STL NaN 7 6 8 5 5 3 4 ... 2 NaN 6 NaN 1 NaN NaN NaN NaN 0
24 1927 NYY NaN 6 5 8 4 7 7 6 ... 1 NaN 1 NaN 3 NaN NaN NaN NaN 0
25 1928 NYY NaN 8 3 8 4 7 7 6 ... 1 NaN 2 NaN 2 NaN NaN NaN NaN 0
26 1929 OAK NaN 8 4 8 1 7 7 6 ... 2 NaN 3 NaN 5 NaN NaN NaN NaN 0
27 1930 OAK NaN 7 6 8 2 7 8 4 ... 5 NaN 3 NaN 1 NaN NaN NaN NaN 0
28 1931 STL NaN 8 5 8 1 6 7 2 ... 6 NaN 3 NaN 2 NaN NaN NaN NaN 0
29 1932 NYY NaN 7 6 8 4 7 8 3 ... 5 NaN 2 NaN 6 NaN NaN NaN NaN 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
84 1987 MIN NaN 5 13 4 8 11 3 12 ... 7 10.0 3 10.0 2 NaN 5.0 3.0 6.0 0
85 1988 LAD NaN 12 14 1 3 13 5 10 ... 4 10.0 2 11.0 11 NaN 12.0 4.0 6.0 0
86 1989 OAK NaN 11 5 1 1 10 7 14 ... 6 5.0 2 9.0 7 NaN 8.0 3.0 7.0 0
87 1990 CIN NaN 7 11 7 6 9 5 4 ... 2 8.0 4 13.0 11 NaN 10.0 1.0 9.0 0
88 1991 MIN NaN 2 10 7 3 6 4 14 ... 1 9.0 7 9.0 6 NaN 1.0 11.0 12.0 0
89 1992 TOR NaN 3 8 13 10 6 4 11 ... 1 7.0 11 10.0 6 NaN 9.0 2.0 5.0 0
90 1993 TOR NaN 3 6 12 6 7 8 5 ... 10 11.0 2 8.0 4 NaN 3.0 2.0 7.0 0
91 1994 None NaN 5 7 11 11 4 1 1 ... 14 12.0 10 9.0 6 NaN 5.0 10.0 3.0 0
92 1995 ATL NaN 9 9 4 4 5 2 1 ... 11 6.0 8 3.0 14 NaN 11.0 13.0 12.0 0
93 1996 NYY NaN 4 3 4 5 6 2 2 ... 3 6.0 9 1.0 7 NaN 4.0 12.0 11.0 0
94 1997 FLA NaN 3 6 4 12 9 14 3 ... 9 2.0 4 1.0 11 NaN 7.0 14.0 10.0 0
95 1998 NYY 14.0 4 7 3 3 4 7 6 ... 15 8.0 2 5.0 6 14.0 2.0 8.0 16.0 0
96 1999 NYY 1.0 7 8 9 13 10 4 1 ... 12 15.0 3 6.0 10 11.0 2.0 5.0 14.0 0
97 2000 NYY 10.0 6 11 12 11 1 5 2 ... 9 12.0 3 4.0 4 14.0 9.0 8.0 14.0 0
98 2001 ARI 3.0 13 13 7 7 6 12 2 ... 15 6.0 5 1.0 4 14.0 3.0 9.0 14.0 0
99 2002 ANA 1.0 10 13 2 11 3 9 10 ... 15 14.0 3 6.0 2 12.0 5.0 7.0 6.0 0
100 2003 FLA 10.0 1 10 1 9 8 13 13 ... 7 14.0 6 7.0 2 12.0 5.0 2.0 12.0 0
101 2004 BOS 16.0 5 6 1 7 3 10 5 ... 13 8.0 2 14.0 1 13.0 4.0 12.0 14.0 0
102 2005 CHW 10.0 4 10 1 9 9 1 4 ... 14 13.0 15 13.0 3 8.0 3.0 5.0 16.0 0
103 2006 STL 7.0 2 10 6 15 3 9 2 ... 16 13.0 10 13.0 6 14.0 4.0 7.0 10.0 0
104 2007 BOS 14.0 3 9 3 8 14 7 6 ... 12 9.0 15 7.0 11 8.0 5.0 10.0 16.0 0
105 2008 PHI 10.0 6 8 2 1 5 12 6 ... 9 16.0 15 13.0 4 9.0 1.0 11.0 14.0 0
106 2009 NYY 8.0 6 11 3 10 12 11 8 ... 16 15.0 13 14.0 7 5.0 7.0 6.0 9.0 0
107 2010 SFG 8.0 5 13 2 10 7 1 12 ... 16 12.0 9 14.0 6 3.0 4.0 6.0 14.0 0
108 2011 STL 4.0 10 7 1 8 11 2 9 ... 14 15.0 16 14.0 1 8.0 3.0 5.0 12.0 0
109 2012 SFG 4.0 7 9 5 14 4 9 13 ... 10 10.0 6 14.0 2 11.0 1.0 7.0 5.0 0
110 2013 BOS 5.0 4 4 1 14 15 3 4 ... 9 12.0 10 12.0 1 9.0 7.0 8.0 6.0 0
111 2014 SFG 11.0 14 6 11 12 8 13 7 ... 4 15.0 5 11.0 9 15.0 10.0 4.0 3.0 0
112 2015 KCR 2.0 15 7 4 6 15 12 11 ... 4 10.0 5 13.0 11 14.0 3.0 1.0 3.0 0
113 2016 CHC 5.0 14 7 1 2 11 8 2 ... 6 10.0 9 3.0 3 14.0 4.0 5.0 4.0 0

114 rows × 33 columns


In [761]:
#import Payroll
Payroll = '/Users/Jack/Desktop/Final/Payroll.csv'
PayrollTable = pd.read_csv(Payroll)
PayrollTable
# Payroll data is in % of league average


Out[761]:
Year Winner ARI ATL BAL BOS CHC CHW CIN CLE ... SDP SFG SEA STL TBD TEX TOR WSN None Unnamed: 33
0 1988.0 LAD 0.0 89.0 89.0 0.0 110.0 53.0 75.0 70.0 ... 88.0 109.0 59.0 125.0 0.0 54.0 104.0 79.0 0.0 11.17
1 1989.0 OAK 0.0 71.0 61.0 116.0 76.0 57.0 83.0 67.0 ... 98.0 105.0 71.0 116.0 0.0 80.0 120.0 92.0 0.0 13.38
2 1990.0 CIN 0.0 77.0 58.0 107.0 83.0 55.0 85.0 87.0 ... 107.0 120.0 74.0 120.0 0.0 87.0 106.0 96.0 0.0 17.38
3 1991.0 MIN 0.0 86.0 62.0 88.0 113.0 71.0 107.0 77.0 ... 95.0 130.0 68.0 90.0 0.0 93.0 116.0 88.0 0.0 23.77
4 1992.0 TOR 0.0 111.0 71.0 110.0 98.0 95.0 118.0 28.0 ... 92.0 111.0 75.0 90.0 0.0 100.0 147.0 53.0 0.0 29.78
5 1993.0 TOR 0.0 124.0 87.0 138.0 124.0 112.0 139.0 51.0 ... 80.0 112.0 103.0 73.0 0.0 116.0 149.0 48.0 0.0 30.78
6 1994.0 None 0.0 128.0 119.0 117.0 113.0 121.0 126.0 90.0 ... 43.0 127.0 88.0 92.0 0.0 102.0 133.0 60.0 0.0 31.63
7 1995.0 ATL 0.0 142.0 129.0 114.0 102.0 125.0 117.0 111.0 ... 82.0 110.0 108.0 97.0 0.0 102.0 157.0 38.0 0.0 31.78
8 1996.0 NYY 0.0 150.0 154.0 91.0 98.0 133.0 129.0 144.0 ... 86.0 110.0 124.0 123.0 0.0 114.0 90.0 49.0 0.0 31.58
9 1997.0 FLA 0.0 134.0 145.0 105.0 105.0 144.0 115.0 143.0 ... 92.0 89.0 105.0 117.0 0.0 133.0 121.0 49.0 0.0 37.79
10 1998.0 NYY 73.0 149.0 176.0 108.0 123.0 92.0 55.0 147.0 ... 113.0 101.0 130.0 131.0 63.0 137.0 121.0 23.0 0.0 40.07
11 1999.0 NYY 147.0 158.0 149.0 109.0 117.0 52.0 89.0 155.0 ... 97.0 97.0 93.0 97.0 80.0 171.0 101.0 34.0 0.0 47.48
12 2000.0 NYY 138.0 147.0 148.0 144.0 111.0 55.0 79.0 136.0 ... 98.0 95.0 105.0 114.0 115.0 126.0 82.0 60.0 0.0 56.21
13 2001.0 ARI 130.0 141.0 114.0 168.0 99.0 100.0 75.0 142.0 ... 59.0 97.0 114.0 120.0 87.0 135.0 118.0 53.0 0.0 65.43
14 2002.0 ANA 152.0 138.0 90.0 161.0 112.0 85.0 67.0 117.0 ... 61.0 116.0 119.0 111.0 51.0 157.0 114.0 57.0 0.0 67.49
15 2003.0 FLA 114.0 150.0 104.0 141.0 113.0 72.0 84.0 64.0 ... 68.0 117.0 123.0 118.0 28.0 146.0 72.0 73.0 0.0 70.93
16 2004.0 BOS 101.0 131.0 75.0 184.0 131.0 94.0 68.0 50.0 ... 80.0 119.0 118.0 121.0 43.0 80.0 72.0 60.0 0.0 69.04
17 2005.0 CHW 85.0 118.0 101.0 169.0 119.0 103.0 85.0 57.0 ... 87.0 123.0 120.0 126.0 41.0 76.0 63.0 66.0 0.0 73.06
18 2006.0 STL 77.0 116.0 94.0 155.0 122.0 132.0 79.0 72.0 ... 90.0 116.0 113.0 115.0 46.0 88.0 93.0 81.0 0.0 77.56
19 2007.0 BOS 63.0 106.0 113.0 173.0 121.0 132.0 83.0 75.0 ... 70.0 109.0 129.0 109.0 29.0 83.0 99.0 45.0 0.0 82.56
20 2008.0 PHI 74.0 114.0 75.0 149.0 132.0 135.0 83.0 88.0 ... 82.0 86.0 131.0 111.0 49.0 76.0 109.0 61.0 0.0 89.55
21 2009.0 NYY 83.0 109.0 76.0 138.0 152.0 109.0 83.0 92.0 ... 49.0 93.0 112.0 88.0 72.0 77.0 91.0 68.0 0.0 88.51
22 2010.0 SFG 67.0 93.0 90.0 179.0 161.0 119.0 80.0 67.0 ... 42.0 107.0 108.0 103.0 79.0 61.0 69.0 67.0 0.0 91.02
23 2011.0 STL 58.0 94.0 92.0 174.0 135.0 138.0 82.0 53.0 ... 49.0 127.0 93.0 114.0 44.0 99.0 67.0 69.0 0.0 92.87
24 2012.0 SFG 76.0 85.0 83.0 177.0 90.0 99.0 84.0 80.0 ... 56.0 84.0 120.0 113.0 65.0 123.0 77.0 83.0 0.0 98.02
25 2013.0 BOS 85.0 84.0 86.0 150.0 98.0 117.0 104.0 78.0 ... 67.0 134.0 79.0 110.0 54.0 120.0 111.0 106.0 0.0 106.25
26 2014.0 SFG 98.0 96.0 93.0 141.0 77.0 79.0 98.0 72.0 ... 78.0 134.0 80.0 96.0 67.0 118.0 115.0 117.0 0.0 115.13
27 2015.0 KCR 54.0 73.0 97.0 138.0 96.0 91.0 97.0 72.0 ... 104.0 137.0 101.0 99.0 61.0 119.0 95.0 143.0 0.0 121.94
28 2016.0 CHC 75.0 65.0 113.0 152.0 128.0 88.0 71.0 75.0 ... 77.0 131.0 110.0 114.0 49.0 123.0 107.0 112.0 0.0 131.26
29 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

30 rows × 34 columns


In [762]:
#import Last WS
LastWS = '/Users/Jack/Desktop/Final/Last WS.csv'
LastWSTable = pd.read_csv(LastWS)
LastWSTable


Out[762]:
Year Winner ARI ATL BAL BOS CHC CHW CIN CLE ... PIT SDP SFG SEA STL TBD TEX TOR WSN None
0 1903.0 BOS 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1 1904.0 None 0 1 1 0 1 1 1 1 ... 1 0 1 0 1 0 0 0 0 0
2 1905.0 SFG 0 1 1 0 1 1 1 1 ... 1 0 1 0 1 0 0 0 0 0
3 1906.0 CHW 0 2 2 1 2 2 2 2 ... 2 0 0 0 2 0 0 0 0 0
4 1907.0 CHC 0 3 3 2 3 0 3 3 ... 3 0 1 0 3 0 0 0 0 0
5 1908.0 CHC 0 4 4 3 0 1 4 4 ... 4 0 2 0 4 0 0 0 0 0
6 1909.0 PIT 0 5 5 4 0 2 5 5 ... 5 0 3 0 5 0 0 0 0 0
7 1910.0 OAK 0 6 6 5 1 3 6 6 ... 0 0 4 0 6 0 0 0 0 0
8 1911.0 OAK 0 7 7 6 2 4 7 7 ... 1 0 5 0 7 0 0 0 0 0
9 1912.0 BOS 0 8 8 7 3 5 8 8 ... 2 0 6 0 8 0 0 0 0 0
10 1913.0 OAK 0 9 9 0 4 6 9 9 ... 3 0 7 0 9 0 0 0 0 0
11 1914.0 ATL 0 10 10 1 5 7 10 10 ... 4 0 8 0 10 0 0 0 0 0
12 1915.0 BOS 0 0 11 2 6 8 11 11 ... 5 0 9 0 11 0 0 0 0 0
13 1916.0 BOS 0 1 12 0 7 9 12 12 ... 6 0 10 0 12 0 0 0 0 0
14 1917.0 CHW 0 2 13 0 8 10 13 13 ... 7 0 11 0 13 0 0 0 0 0
15 1918.0 BOS 0 3 14 1 9 0 14 14 ... 8 0 12 0 14 0 0 0 0 0
16 1919.0 CIN 0 4 15 0 10 1 15 15 ... 9 0 13 0 15 0 0 0 0 0
17 1920.0 CLE 0 5 16 1 11 2 0 16 ... 10 0 14 0 16 0 0 0 0 0
18 1921.0 SFG 0 6 17 2 12 3 1 0 ... 11 0 15 0 17 0 0 0 0 0
19 1922.0 SFG 0 7 18 3 13 4 2 1 ... 12 0 0 0 18 0 0 0 0 0
20 1923.0 NYY 0 8 19 4 14 5 3 2 ... 13 0 0 0 19 0 0 0 0 0
21 1924.0 MIN 0 9 20 5 15 6 4 3 ... 14 0 1 0 20 0 0 0 0 0
22 1925.0 PIT 0 10 21 6 16 7 5 4 ... 15 0 2 0 21 0 0 0 0 0
23 1926.0 STL 0 11 22 7 17 8 6 5 ... 0 0 3 0 22 0 0 0 0 0
24 1927.0 NYY 0 12 23 8 18 9 7 6 ... 1 0 4 0 0 0 0 0 0 0
25 1928.0 NYY 0 13 24 9 19 10 8 7 ... 2 0 5 0 1 0 0 0 0 0
26 1929.0 OAK 0 14 25 10 20 11 9 8 ... 3 0 6 0 2 0 0 0 0 0
27 1930.0 OAK 0 15 26 11 21 12 10 9 ... 4 0 7 0 3 0 0 0 0 0
28 1931.0 STL 0 16 27 12 22 13 11 10 ... 5 0 8 0 4 0 0 0 0 0
29 1932.0 NYY 0 17 28 13 23 14 12 11 ... 6 0 9 0 0 0 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
85 1988.0 LAD 0 30 84 4 79 70 11 39 ... 8 18 33 10 5 0 26 10 18 0
86 1989.0 OAK 1 31 85 5 80 71 12 40 ... 9 19 34 11 6 0 27 11 19 0
87 1990.0 CIN 2 32 86 6 81 72 13 41 ... 10 20 35 12 7 0 28 12 20 0
88 1991.0 MIN 3 33 87 7 82 73 0 42 ... 11 21 36 13 8 0 29 13 21 0
89 1992.0 TOR 4 34 88 8 83 74 1 43 ... 12 22 37 14 9 0 30 14 22 0
90 1993.0 TOR 5 35 89 9 84 75 2 44 ... 13 23 38 15 10 0 31 0 23 0
91 1994.0 None 6 36 90 10 85 76 3 45 ... 14 24 39 16 11 0 32 0 24 0
92 1995.0 ATL 6 36 90 10 85 76 3 45 ... 14 24 39 16 11 0 32 0 24 0
93 1996.0 NYY 7 0 91 11 86 77 4 46 ... 15 25 40 17 12 0 33 1 25 0
94 1997.0 FLA 8 1 92 12 87 78 5 47 ... 16 26 41 18 13 0 34 2 26 0
95 1998.0 NYY 9 2 93 13 88 79 6 48 ... 17 27 42 19 14 0 35 3 27 0
96 1999.0 NYY 10 3 94 14 89 80 7 49 ... 18 28 43 20 15 0 36 4 28 0
97 2000.0 NYY 11 4 95 15 90 81 8 50 ... 19 29 44 21 16 1 37 5 29 0
98 2001.0 ARI 12 5 96 16 91 82 9 51 ... 20 30 45 22 17 2 38 6 30 0
99 2002.0 ANA 0 6 97 17 92 83 10 52 ... 21 31 46 23 18 3 39 7 31 0
100 2003.0 FLA 1 7 98 18 93 84 11 53 ... 22 32 47 24 19 4 40 8 32 0
101 2004.0 BOS 2 8 99 19 94 85 12 54 ... 23 33 48 25 20 5 41 9 33 0
102 2005.0 CHW 3 9 100 0 95 86 13 55 ... 24 34 49 26 21 6 42 10 34 0
103 2006.0 STL 4 10 101 1 96 0 14 56 ... 25 35 50 27 22 7 43 11 35 0
104 2007.0 BOS 5 11 102 2 97 1 15 57 ... 26 36 51 28 0 8 44 12 36 0
105 2008.0 PHI 6 12 103 0 98 2 16 58 ... 27 37 52 29 1 9 45 13 37 0
106 2009.0 NYY 7 13 104 1 99 3 17 59 ... 28 38 53 30 2 10 46 14 38 0
107 2010.0 SFG 8 14 105 2 100 4 18 60 ... 29 39 54 31 3 11 47 15 39 0
108 2011.0 STL 9 15 106 3 101 5 19 61 ... 30 40 0 32 4 12 48 16 40 0
109 2012.0 SFG 10 16 107 4 102 6 20 62 ... 31 41 1 33 0 13 49 17 41 0
110 2013.0 BOS 11 17 108 5 103 7 21 63 ... 32 42 0 34 1 14 50 18 42 0
111 2014.0 SFG 12 18 109 0 104 8 22 64 ... 33 43 1 35 2 15 51 19 43 0
112 2015.0 KCR 13 19 110 1 105 9 23 65 ... 34 44 0 36 3 16 52 20 44 0
113 2016.0 CHC 14 20 111 2 106 10 24 66 ... 35 45 1 37 4 17 53 21 45 0
114 NaN NaN 15 21 112 3 0 11 25 67 ... 36 46 2 38 5 18 54 22 46 0

115 rows × 33 columns

Create Lists for Graphs

Because we are only concerned with World Series winners, we need to remove the data from each table that does not to belong to the winner of that season.


In [763]:
# Create new dataframe with just the Year and Winner columns from the original
RunsTableWinner = RunsTable[['Year','Winner']] 
# Convert data frame to a list of just the names of the teams that won each world series
RunsTableWinner2 = (RunsTableWinner.loc[0:113, 'Winner'])
RunsTableWinner3 = list(RunsTableWinner2)
# Locate and store the value corresponding to each winner for a given year. 
# A loop makes the process much easier given the 113 year.
RunsTableWinner4 = []
for y in range(0,114):
    RunsTableWinner4.append(RunsTable.loc[y,RunsTableWinner3[y]])
# When the loop data was stored in RunsTableWinner4, the string was repeated several times.  
# To erase the excess data, just the first 115 entries were stored in RunsTableWinner5.
RunsTableWinner5 = RunsTableWinner4[0:114]
# Erase years when the value was 0: no world series occured. 
# This occured in 1904 and 1994 due to business disaggrements between the American League and Naitonal League and 
# a players strike, respectively. 
RunsTableWinner6 = [x for x in RunsTableWinner5 if x != 0]
print(RunsTableWinner6)
# The following cells are the exact same process but for the other data sets. 
# All explanations and code are the exact same. Only the names of variables will change.


[1, 1, 3, 2, 2, 1, 2, 1, 1, 1, 2, 3, 6, 1, 4, 2, 1, 1, 3, 3, 5, 1, 1, 1, 1, 2, 2, 2, 1, 4, 1, 1, 1, 1, 1, 1, 3, 2, 1, 1, 1, 2, 1, 1, 3, 2, 2, 2, 2, 1, 3, 1, 1, 1, 1, 3, 1, 2, 1, 6, 2, 8, 4, 2, 1, 9.0, 2, 1, 2, 1, 3, 1, 1, 4, 4, 1, 2, 4, 5, 7, 1, 13.0, 1.0, 8, 6, 4, 5, 4, 2.0, 2.0, 9, 9, 8.0, 1, 3, 6, 3.0, 4.0, 8.0, 1, 9, 6, 3, 2, 1, 9, 1, 6, 1, 5, 6.0, 2]

In [764]:
RunsAgainstTableWinner = RunsAgainstTable[['Year','Winner']] 
RunsAgainstTableWinner2 = (RunsAgainstTableWinner.loc[0:113, 'Winner'])
RunsAgainstTableWinner3 = list(RunsAgainstTableWinner2)
RunsAgainstTableWinner4 = []
for y in range(0,114):
    RunsAgainstTableWinner4.append(RunsAgainstTable.loc[y,RunsAgainstTableWinner3[y]])
RunsAgainstTableWinner5 = RunsAgainstTableWinner4[0:114]
RunsAgainstTableWinner6 = [x for x in RunsAgainstTableWinner5 if x != 0]
print(RunsAgainstTableWinner6)


[1, 2, 1, 1, 3, 2, 1, 1, 1, 4, 3, 2, 1, 2, 1, 1, 2, 2, 1, 1, 1, 3, 4, 1, 2, 1, 2, 2, 2, 1, 3, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 4, 1, 1, 1, 2, 2, 2, 1, 1, 1, 1, 2, 2, 2, 3, 1, 2, 2, 1, 7, 1, 10, 2, 1, 2.0, 9, 5, 2, 3, 1, 3, 5, 1, 1, 3, 5, 2, 1, 10, 1, 2.0, 2.0, 9, 2, 1, 1, 3, 8.0, 5.0, 1, 3, 4.0, 1, 2, 4, 2.0, 1.0, 6.0, 4, 3, 5, 1, 3, 6, 2, 9, 6, 6, 6, 3.0, 1]

In [765]:
WinsTableWinner = WinsTable[['Year','Winner']] 
WinsTableWinner2 = (WinsTableWinner.loc[0:113, 'Winner'])
WinsTableWinner3 = list(WinsTableWinner2)
WinsTableWinner4 = []
for y in range(0,114):
    WinsTableWinner4.append(WinsTable.loc[y,WinsTableWinner3[y]])
WinsTableWinner5 = WinsTableWinner4[0:114]
WinsTableWinner6 = [x for x in WinsTableWinner5 if x != 0]
print(WinsTableWinner6)


[65, 68, 60, 69, 64, 72, 66, 66, 68, 62, 61, 66, 59, 65, 58, 69, 64, 61, 60, 64, 60, 62, 58, 71, 66, 68, 66, 66, 69, 59, 62, 60, 66, 66, 64, 69, 65, 66, 69, 64, 68, 57, 63, 63, 63, 63, 64, 62, 62, 64, 63, 64, 63, 62, 60, 56, 62, 67, 58, 61, 57, 60, 44, 62, 64, 62, 54, 60, 60, 58, 56, 67, 63, 62, 61, 60, 56, 57, 57, 48, 64, 56, 67, 52, 58, 61, 56, 59, 59, 59, 62, 57, 57, 70, 60, 54, 57, 61, 56, 60, 61, 51, 59, 56, 63, 57, 56, 58, 60, 54, 59, 64]

In [766]:
PayrollTableWinner = PayrollTable[['Year','Winner']] 
PayrollTableWinner2 = (PayrollTableWinner.loc[0:113, 'Winner'])
PayrollTableWinner3 = list(PayrollTableWinner2)
# The range is smaller because the only goes back to 1988
PayrollTableWinner4 = []
for y in range(0,28):
    PayrollTableWinner4.append(PayrollTable.loc[y,PayrollTableWinner3[y]])
PayrollTableWinner5 = PayrollTableWinner4[0:28]
PayrollTableWinner6 = [x for x in PayrollTableWinner5 if x != 0]
print(PayrollTableWinner6)


[138.0, 109.0, 85.0, 94.0, 147.0, 149.0, 142.0, 165.0, 126.0, 158.0, 186.0, 165.0, 130.0, 91.0, 69.0, 184.0, 103.0, 115.0, 173.0, 110.0, 228.0, 107.0, 114.0, 84.0, 150.0, 134.0, 93.0]

In [767]:
LastWSTableWinner = LastWSTable[['Year','Winner']] 
LastWSTableWinner2 = (LastWSTableWinner.loc[0:113, 'Winner'])
LastWSTableWinner3 = list(LastWSTableWinner2)
LastWSTableWinner4 = []
for y in range(0,114):
    LastWSTableWinner4.append(LastWSTable.loc[y,LastWSTableWinner3[y]])
LastWSTableWinner5 = LastWSTableWinner4[0:114]
# To remove the 1904 and 1994 values with no winner I could not just delete zero values because there are many years
# where the same team won consequtively: the second year would have a value of zero years since the teams last win.
# I created 3 individual lists, which excluded 1904 and 1994 and then combined them into one list
a = LastWSTableWinner5[0:1]
b = LastWSTableWinner5[2:91]
c = LastWSTableWinner5[92:114]
LastWSTableWinner6 = a + b + c

In [768]:
# m will be used as the x values
m = range(0,112)
# sets size of the figure
plt.figure(figsize=(15,5))
# plotting of the figure
plt.plot(m, RunsTableWinner6)
# x axis range
plt.xlim(0,113)
# x axis label
plt.xlabel("Year's Since 1903")
# y axis range
plt.ylim(0,15)
# y axis label
plt.ylabel('Runs Scored Ranking')
# plot title
plt.title('Runs Scored Ranking of World Series Winner Over Time')

# n will be used for x values
n = list(range(1,16))
# creates list of frequencis of each value in range(1,16)
RunsCount = []
for l in range(1,16):
    RunsCount.append(RunsTableWinner6.count(l))
# sets size of chart
plt.figure(figsize=(15,5))
# plots graph
pyplot.bar(n, RunsCount, align='center')
# x axis range
plt.xlim(0,15)
# x axis label
plt.xlabel('Ranking')
# y axis range
plt.ylim(0,50)
# y axis label
plt.ylabel('Frequency')
# plot title
plt.title('Frequency of Runs Scored Ranking by World Series Winners')
# The following cells are the exact same process but for the other data sets. 
# All explanations and code are the exact same. Only the names of variables will change.


Out[768]:
<matplotlib.text.Text at 0x133aa36d8>

In [769]:
m = range(0,112)
plt.figure(figsize=(15,5))
plt.plot(m, RunsAgainstTableWinner6)
plt.xlim(0,113)
plt.xlabel("Year's Since 1903")
plt.ylim(0,15)
plt.ylabel('Runs Against Ranking')
plt.title('Runs Against Ranking of World Series Winner Over Time')

n = list(range(1,16))
RunsAgainstCount = []
for l in range(1,16):
    RunsAgainstCount.append(RunsAgainstTableWinner6.count(l))
plt.figure(figsize=(15,5))
pyplot.bar(n, RunsAgainstCount, align='center')
plt.xlim(0,15)
plt.xlabel('Ranking')
plt.ylim(0,50)
plt.ylabel('Frequency')
plt.title('Frequency of Runs Against Ranking by World Series Winners')


Out[769]:
<matplotlib.text.Text at 0x133dee128>

Side note for following data

Because the variation of results in the next three data sets is much greater than the preceeding ranking, I divided all points by 10, creating "boxes" of values that fall between 2 numbers (6 is 6 and 7 for example). I continued this process for the remanding data sets, changing little but the ranges.


In [770]:
m = range(0,112)
plt.figure(figsize=(15,5))
plt.plot(m, WinsTableWinner6)
plt.xlim(0,113)
plt.xlabel("Year's Since 1903")
plt.ylim(40,80)
plt.ylabel('Percent of Games Won')
plt.title('Win Percentage of World Series Winner Over Time')

# Process described above
q = range(3,9)
WinsCountDivide = list((int(x/10) for x in WinsTableWinner6))
WinsCount = []
for f in range(3,9):
    WinsCount.append(WinsCountDivide.count(f))
plt.figure(figsize=(15,5))
pyplot.bar(q, WinsCount, align='center')
plt.xlim(3,8)
plt.xlabel('Win Percentage, measuered in 10% points')
plt.ylim(0,80)
plt.ylabel('Frequency')
plt.title('Frequency of Win Percentage by World Series Winners')


Out[770]:
<matplotlib.text.Text at 0x1341dcbe0>

In [771]:
m = range(0,112)
plt.figure(figsize=(15,5))
plt.plot(m, LastWSTableWinner6)
plt.xlim(0,113)
plt.xlabel("Year's Since 1903")
plt.ylim(0,108)
plt.ylabel('Years Since Last World Series Win')
plt.title('Years Since Last World Series Win of World Series Winner Over Time')

t = range(0,12)
LastWSCountDivide = list((int(x/10) for x in LastWSTableWinner6))
LastWSCount = []
for f in range(0,12):
    LastWSCount.append(LastWSCountDivide.count(f))
plt.figure(figsize=(15,5))
pyplot.bar(t, LastWSCount, align='center')
plt.xlim(-1,11)
plt.xlabel('Years Since Last World Series Win, measuered in 10 years')
plt.ylim(0,70)
plt.ylabel('Frequency')
plt.title('Frequency of Years Since Last World Series Win by World Series Winners')


Out[771]:
<matplotlib.text.Text at 0x133a35518>

In [772]:
m = range(0,27)
plt.figure(figsize=(15,5))
plt.plot(m, PayrollTableWinner6)
plt.xlim(0,27)
plt.xlabel("Year's Since 1988")
plt.ylim(50,230)
plt.ylabel('Percentage of League Average Payroll')
plt.title('Percentage of League Average Payroll of World Series Winner Over Time')

t = range(6,23)
PayrollCountDivide = list((int(x/10) for x in PayrollTableWinner6))
PayrollCount = []
for f in range(6,23):
    PayrollCount.append(PayrollCountDivide.count(f))
plt.figure(figsize=(15,5))
pyplot.bar(t, PayrollCount, align='center')
plt.xlim(5,23)
plt.xlabel('Payroll Percentage of League Average measuered in 10 % points')
plt.ylim(0,5)
plt.ylabel('Frequency')
plt.title('Frequency of Payroll Percentage of League Average by World Series Winners')


Out[772]:
<matplotlib.text.Text at 0x13273e278>

Analyzing the Data

Runs Scored

Of all World Series Winners, 63% have been ranked number 1 or number 2 in runs scored. That being said, the line graph suggests that teams ranked higher have had increasingly better odds of winning as the years have passed. One possible explaination for that claim is the large expansion that occured in the 1960's. More teams simply meant there were more non 1 and 2 rankings

Runs Scored Against

The results for runs scored against are very similar: 68% of winners have been ranked number 1 or 2, with more variation in recent years. I thought it would be interesting to plot both graphs on the same set of axis.


In [773]:
# plotting both lines together 
m = range(0,112)
plt.figure(figsize=(20,5))
plt.plot(m, RunsTableWinner6, 'r') # r changes the color 
plt.plot(m, RunsAgainstTableWinner6)
plt.xlim(0,113)
plt.xlabel("Year's Since 1988")
plt.ylim(0,14)
plt.ylabel('Ranking')
plt.title('Runs Scored and Runs Allowed Ranking of World Series Winner Over Time')
# Runs Scored is Red, Runs Scored Against is Blue


Out[773]:
<matplotlib.text.Text at 0x131d52128>

Analyzing the Data

Runs Scored and Runs Scored Against

Looking at both lines together, it appears that when one has a high ranking the other tends to be low. There are clear exceptions to that claim, but in general it appears to hold true.

Wins

94% of winners have had a regular season win percentage between 50 and 70 and 63% have been between 60 and 70. In my opinion, these results are misleading. The season is 162 games long; so a team that wins 81 games is sitting at 50%. A team that wins 100 games has a win percentage of 62%. There tends to only be a handful of teams that win 100 games (62%) in each season. Anything above 110 games will go down in record books: the current record for wins in a season is 116 set by the 1906 Cubs and 2001 Mariners. The win percentage for those two teams were 76% and 72% respectively (the season was 10 games shorter in 1906). Accordingly, those teams that win more games have a much greater chance of winning the world series.

The problem with this conclusion is that it's obvious and almost pointless in the context of the baseball playoff system. If chosen at random, each team has about a 3% (1/30) chance of winning the World Series. However, once qualifying for the playoffs, each time has a 13% chance of winning (1/8). Only the teams with the best record make the playoffs. So in reality, only the teams that have won more games even have a chance of winning the World Series. Accordingly, the claim that more wins results in a greater chance of a World Series win, perhaps only applies to those few teams that win significantly more than 100 games.

Last World Series Win

60% of World Series winners had only waited between 0 and 4 year since their last title. Logically this makes sense, players are held to contracts that can last as long as 10 years, however, 3-5 are definitely more standard lengths. So a team that is good--wins the World Series--has a better chance of being good in the future, than do teams that are not currently good. As an aside, the line graph is slightly misleading. It looks as though the values belonging to World Series winners early on were much smaller than more recent ones. That is simply because the world series started in 1903, so obviously a team that wins its first world series in 1907 will have a smaller value than a team that wins its first in 1984.

Payroll

Since 1988, only 15% of World Series winners have been below the average league payroll. Clearly a sample of 27 values is not ideal, but the data certainly coincides with trends seen in the game. Players make increasingly large sums of money and a new emphasis has been placed on signing free agents. Only the teams with the largest monetary backing can consistantly sign those free agents. Those teams tend to be the most popular ones as well: Yankees, Red Sox, Giants, Cardinals, etc. Not surprisingly, those teams are also the ones that have won the most World Series. Simply put, World Series wins leads to fans, which leads to monetary backing, which leads to signing free agents, which leads to World Series Wins.

Summary and Final Observation

After looking at the data, nothing was too surprising. The teams that perform the best during the regular season have the best chance of winning the World Series. That being said, if I were to present some sort of analytical guideline for prediction making, the following qualities would be required for a team: Runs Rank: 1-2 Runs Against Rank: 1-2 At least 1 should hold, both do not have to. Wins: If a team has significantly over 100 wins, choose it. Years Since Last World Series: If a team has won the world series in the last 4 years, choose it. Payroll: Don't choose a team that's payroll is below the league average.

Although it is not a sufficient test of my recommendations, let's consider the previous 2 years. The winners were the Cubs in 2016 and the Royals in 2015. The Cubs data, in the same order as above, is as follows: 2 runs, 1 runs against, 103 wins, 106 years, and 167.4% of league average. For the Cubs 4 of my 5 guidelines worked. The data for the Royals is the following: 6, 3, 95, 28, 112%. My exact guidlines were 1 for 5 for the Royals. That being said, the team still had a solid balance of runs, runs against, and wins; they just slightly missed my cutoffs. The Royals were also an exception to typical league tendencies. They were a team of young players and only couple of big names. They caught much of the world by surprise. In fact, more and more surprising teams have emerged since the 2002 A's magical season that inspired the book and movie "Moneyball." That team threw out the idea of signing big free agents and replaced it with a montra of getting more bang for your buck: signing players to serve very specific roles. While few teams have won the World Series using this model, it has gained popularity and produced highly competitive teams. As time goes on and the game evolves further, maybe even runs, wins, and payroll wont give any insight into future World Series Winners.